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Abstract:  More  and  more  High  Performance  Embedded  Computing  (HPEC)  leverages 

technology  from  commercial  high  performance  computing  systems.  To  date,  HPEC  has  only 
tapped  the  lower  end  of  commercial  high  performance  computing  technology.  As  more  of  the 
advanced  commercial  technology  moves  into  the  embedded  space,  this  presents  a  unique 
opportunity  to  change  the  fundamentals  of  how  HPEC  solutions  are  addressed. 

Within  HPEC,  two  types  of  application  specific  processing  elements,  reconfigurable  and  custom 
are  being  used.  Reconfigurable  elements  are  comprised  of  Field  Programmable  Gate  Array 
(FPGA)  technology  and  custom  elements  are  comprised  of  various  devices  as  Digital  Signal 
Processors  (DSP),  DARPA  Polymorphic  Computing  Architectures  (PC A)  [1]  and  others.  The 
integration  of  these  devices  presents  significant  challenges  both  to  the  system  architecture  and  to 
the  programming  models.  This  presentation  will  describe  a  set  of  system  requirements  and 
methods  to  not  only  include  these  application  specific  processing  devices  but  to  allow  effecting 
scaling  of  application  specific  processing  devices. 


Introduction  and  System  Architectural  Review 

SGI’s  ccNUMA  (cache  coherent  non-uniform  memory  architecture)  [2]  global  shared  memory 
system  architecture  is  the  basis  for  our  general-purpose  Origin  and  Altix  HPC  systems.  The 
presentation  will  explore  the  architectural  features  used  within  both  Origin  and  Altix  systems 
which  allows  scaling  in  excess  of  one  thousand  commercial  high  performance  processors.  The 
architecture  of  SGI’s  systems  which  included  192  custom  application  specific  processors 
(Tensor  Processor  Units)  and  128  general-purpose  processors  will  also  be  described. 


System  Architectural  Features  for  Scalability 

High-performance  FPGAs  represent  an  important  architectural  tool  to  provide  total  system 
performance  while  keeping  both  power  and  space  requirements  to  a  minimum.  The  use  of  FPGA 
elements  have  been  limited  to  only  the  most  computationally  dense  algorithms  by  the  amount  of 
data  which  can  be  pass  through  the  device.  But  even  with  these  limitations,  significant 
performance  and  power  improvements  have  been  demonstrated  [3].  This  presentation  will 
describe  methods  to  increase  the  bandwidth  to  these  devices  and  the  communication  primitives 
required  to  scale  to  hundreds  of  devices  with  out  sacrificing  performance. 
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Systems  that  include  hundreds  of  application  specific  processing  devices  pose  a  significant 
software  challenge.  This  presentation  will  describe  how  to  effectively  allocate,  manage,  and 
decommission  under  changing  workloads  various  application  specific  processing  elements.  Both 
software  methods  and  APIs  will  be  presented  on  ways  to  interface  application  specific 
processing  targeted  at  HPEC  applications. 
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Multi-Paradigm  Computing  Ultraviolet 


Terascale  to  Petascale  Data  Set : 
Bring  Function  to  Data 
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Software 


•Provide  for  HDL  modules 

Integrated  environment  with  debugger 
Highest  performance 

•Leverage  3rd  Party  Std  Language  Tools 

Celoxia,  Impulse  Acceleration,  Mitrion,  Mentor  Graphics 

•Developed  an  FPGA  aware  version  of  GDB 

Capable  of  debugging  the  FPGA  and  System  Software 
Capable  of  multiple  CPUs  and  multiple  FPGAs 

•Developed  RASC  Abstraction  Layer  (RASCAL) 
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Abstraction  Layer:  Algorithm  API 
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The  Abstraction  Layer’s  algorithm  API  mirrors  the  COP 
API  with  a  few  additions  that  enable  wide  scaling, 


and  deep  scaling. 
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Hardware 


•Direct  Connection  to  NUMAIink4 

6.4GB/s/connection 

•Fast  System  Level  Reprogramming  of  FPGA 
•Atomic  Memory  Operations 

Same  set  as  System  CPUs 

•Hardware  Barriers 

•Configurations  to  8191 NUMA/FPGA  connections 
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Architectural  Challenges 


*  Ease  of  Use 

-  Languages 

-  Compilers 

-  Debuggers 

-  APIs 

•  Performance 

-  Bandwidth  to/from  System 

-  Scalability 
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